Search CORE

58 research outputs found

Establishing a New State-of-the-Art for French Named Entity Recognition

Author: Dupont Yoann
Muller Benjamin
Romary Laurent
Sagot Benoît
Suárez Pedro Javier Ortiz
Publication venue
Publication date: 11/05/2020
Field of study

The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

KOMBINASI PATI JANENG-KITOSAN DENGAN BAHAN PENGAWET ALAMI KUNYIT DAN ASAM ASKORBAT SEBAGAI EDIBLE COATING

Author: Ortiz Suárez Pedro Javier
Romary Laurent
Sagot Benoît
Publication venue: FAKULTAS MATEMATIKA DAN ILMU PENGETAHUAN ALAM UNIVERSITAS SYIAH KUALA
Publication date: 01/01/2017
Field of study

Edible coating merupakan salah satu solusi dalam pengawetan makanan saat ini. Pada penelitian ini dibuat edible coating dari komposit edible film pati janeng-kitosan-pengawet alami (kunyit dan asam askorbat) yang diaplikasikan terhadap bakso dan keju. Komposisi terbaik dari komposit edible film dengan perbandingan pati-kitosan-kunyit (1,2% : 0,4% : 0,375%) dan pati-kitosan-asam askorbat (1,2% : 0,4% : 0,5%) yang diperoleh berdasarkan uji kuat tarik, elongasi dan warna digunakan sebagai aplikasi edible coating. Uji antimikrobial menunjukkan edible film yang dikombinasikan dengan kunyit dan asam askorbat mampu menghambat pertumbuhan bakteri E. Coli dengan diameter zona hambat yang lebih besar daripada edible film pati janeng-kitosan yaitu masing-masingnya 7 mm. Pelapisan sampel keju dengan edible coating mampu menurunkan jumlah pertumbuhan mikroba, menghambat terjadinya oksidasi lemak hingga 50% dan memperkecil kenaikan kadar air hingga 41,17% selama 3 bulan penyimpanan dibandingkan dengan keju yang tidak dilapisi edible coating, sedangkan coating pada sampel bakso mampu memperkecil kenaikan kadar air hingga 20,76%, analisis sensori terhadap sampel bakso yang dilapisi edible coating dari segi aroma, tekstur dan warna menyarankan bahwa bakso yang dilapisi lebih baik dibandingkan bakso yang tidak dilapisi setelah penyimpanan selama 3 hari. Kata kunci: Edible coating, pati janeng, kitosan, kunyit, asam askorbat, antimikrobaBanda Ace

INRIA a CCSD electronic archive server

ETD - Unsyiah Central Library

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Author: Romary Laurent
Sagot Benoît
Suárez Pedro Javier Ortiz
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 18/06/2020
Field of study

We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Un jeu de données pour la détection automatique de lieux dans les textes français modernes

Author: Gabay Simon
Ortiz Suárez Pedro Javier
Publication venue: HAL CCSD
Publication date: 28/05/2021
Field of study

International audienc

INRIA a CCSD electronic archive server

Hal-Diderot

How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures

Author: Galleron Ioana
Khemakhem Mohamed
Ortiz Suárez Pedro Javier
Romary Laurent
Williams Geoffrey
Publication venue: HAL CCSD
Publication date: 18/09/2019
Field of study

International audienc

SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German

Author: Dupont Yoann
Lejeune Gaël
Ortiz Suárez Pedro Javier
Tian Tian
Publication venue: HAL CCSD
Publication date: 23/09/2020
Field of study

International audienceIn this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers. The challenge proposed various tasks for three languages, among them we focused on Named Entity Recognition in French and German texts. The best system we proposed ranked third for these two languages, it uses FastText em-beddings and Elmo language models (FrELMo and German ELMo). We show that combining several word representations enhances the quality of the results for all NE types and that the segmentation in sentences has an important impact on the results

INRIA a CCSD electronic archive server

How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures

Author: Galleron Ioana
Khemakhem Mohamed
Ortiz Suárez Pedro Javier
Romary Laurent
Williams Geoffrey
Publication venue: HAL CCSD
Publication date: 18/09/2019
Field of study

International audienc

INRIA a CCSD electronic archive server

Establishing a New State-of-the-Art for French Named Entity Recognition

Author: Dupont Yoann
Muller Benjamin
Ortiz Suárez Pedro Javier
Romary Laurent
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 11/05/2020
Field of study

Due to COVID19 pandemic, the 12th edition is cancelled. The LREC 2020 Proceedings are available at http://www.lrec-conf.org/proceedings/lrec2020/index.htmlInternational audienceThe French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations

INRIA a CCSD electronic archive server

French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

Author: Fabre Murielle
Ortiz Suárez Pedro Javier
Sagot Benoît
Villemonte de La Clergerie Éric
Publication venue: HAL CCSD
Publication date: 16/05/2020
Field of study

International audienceThis paper describes and compares the impact of different types and size of training corpora on language models like ELMO. By asking the fundamental question of quality versus quantity we evaluate four French corpora for training on parsing scores, POS-tagging and named-entities recognition downstream tasks. The paper studies the relevance of a new corpus, CaBeRnet, featuring a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative and balanced corpora will allow the language model to be more efficient and representative of a given language and therefore yield better evaluation scores on different evaluation sets and tasks

INRIA a CCSD electronic archive server

HAL Descartes

CamemBERT: a Tasty French Language Model

Author: de la Clergerie Éric Villemonte
Dupont Yoann
Martin Louis
Muller Benjamin
Romary Laurent
Sagot Benoît
Seddah Djamé
Suárez Pedro Javier Ortiz
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.Comment: ACL 2020 long paper. Web site: https://camembert-model.f

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes